Langsmith: An Interactive Academic Text Revision System


Takumi Ito,1,2 , Tatsuki Kuribayashi,1,2, Masatoshi Hidaka,3, Jun Suzuki1,4, and Kentaro Inui1,4

1Tohoku University 2Langsmith Inc. 3Edge Intelligence Systems Inc. 4RIKEN

Langsmith

{t-ito, kuribayashi, jun.suzuki, inui}@ecei.tohoku.ac.jp hidaka@edgeintelligence.jp


Abstract

Despite the current diversity and inclusion initiatives in the academic community, re- searchers with a non-native command of En- glish still face significant obstacles when writ- ing papers in English. This paper presents the Langsmith editor, which assists inexpe- rienced, non-native researchers to write En- glish papers, especially in the natural lan- guage processing (NLP) field. Our system can suggest fluent, academic-style sentences to writers based on their rough, incomplete phrases or sentences. The system also en- courages interaction between human writers and the computerized revision system. The experimental results demonstrated that Lang- smith helps non-native English-speaker stu-


human


Please rephrase the words around saw.

suggest

diverse

Okay. Is there anything you’d like to write? candidates

The first one is exactly what I was trying to say!

We observed significant differences in the results between A and B.

We saw difference in the results between A and B.

request


select

arXiv:2010.04332v1 [cs.CL] 9 Oct 2020

revision system

dents write papers in English. The system is available at https://emnlp-demo.editor. langsmith.co.jp/.


  1. Introduction

    Currently, diversity and inclusion in the natural language processing (NLP) community are encour- aged. In fact, at the latest NLP conference at the time of writing1, papers were submitted from more than 50 countries. However, one obstacle can limit this diversity: The papers must be written in En- glish. Writing papers in English can be a daunt- ing task, especially for inexperienced, non-native speakers. These writers often struggle to put their ideas into words.

    To address this problem, we built the Langsmith editor, an assistance system for writing NLP papers in English.2 The main feature in Langsmith is a revision function, which suggests fluent, academic-


    The authors contributed equally

    1The 58th Annual Meeting of the Association for Compu-

    tational Linguistics

    2See https://www.youtube.com/channel/ UCjHeZPe0tT6bWxVVvum1bFQ for the screencast.

    Figure 1: An overview of interactively writing texts

    with a revision system.


    style sentences based on writersrough, incom- plete drafts.

    The drafts might be so rough that it becomes challenging to understand the user’s intended mean- ing to use as inputs. In addition, several potentially plausible revisions can exist for the drafts, espe- cially when the input draft is incomplete.

    Based on such difficulties, our system provides two ways for users to customize the revision: the users can (i) request specific revisions, and (ii) se- lect a suitable revision from diverse candidates (Fig- ure 1). In particular, the request stage allows users to specify the parts that require intensive revision.

    Our experiments demonstrate the effectiveness of our system. Specifically, students whose first language is Japanese, which differs greatly from English, managed to write better drafts when work- ing with Langsmith.

    Langsmith has other assistance features as well, such as text completion with a neural language


    Figure 2: Screenshot of Langsmith. The revision feature suggests various revisions for the input “Grammar error correction (GEC) () of automatically correcting errors made by a human writer in text.” The characters highlighted in green are added to the original sentence, and the red points indicate tracked deletions.


    model. Furthermore, the communication between the server and the web frontend is achieved via a protocol specialized in writing software called the Text Editing Assistance Smartness Protocol for Nat- ural Language (TEASPN) (Hagiwara et al., 2019). We hope that our system will help the NLP com- munity and researchers, especially those lacking a native command of English.3

  2. Related work

    1. Natural language processing for academic writing

      Academic writing assistance has gained consider- able attention in NLP (Wu et al., 2010; Yimam et al., 2020; Lee and Webster, 2012), and several shared tasks have been organized (Dale and Kil- garriff, 2011; Daudaravicˇius, 2015). These tasks focus on polishing texts in already published ar- ticles or documents near completion. In contrast, this study focuses on revising texts in the earlier stages of writing (e.g., first drafts), where inexpe- rienced, non-native authors might even struggle to convey their ideas accurately.

      Ito et al. (2019) introduced a dataset and models for revising early-stage drafts, and the 1-to-N na- ture of the revisions was pointed out. We tackled this difficulty by designing an overall demonstra- tion system, including a user interface.

    2. Writing assistance tools

      Error checkers. Grammar/spelling checkers are typical writing assistance tools. Some highlight er- rors (e.g., Write&Improve4), while others suggest


      3This paper was also written using Langsmith.

      4writeandimprove.com

      corrections (e.g., Grammarly5, LanguageTool6, Ginger7, and LinggleWrite; see Tsai et al. (2020)) for writers.

      Langsmith has a revision feature (Ito et al., 2019), as well as a grammar/spelling checker. The revision feature suggests better versions of poor written phrases or sentences in terms of fluency and style, whereas error checkers are typically de- signed to correct apparent errors only. In addition, Langsmith is specialized for the NLP domain and enables domain-specific revisions, such as correct- ing technical terms.

      Text completion. Completing a text is another typical feature in writing assistance applications (WriteAhead8, Write With Transformer9, and Smart Compose; see Chen, Mia Xu and Lee, Benjamin

      N. and Bansal, Gagan and Cao, Yuan and Zhang, Shuyuan and Lu, Justin and Tsay, Jackie and Wang, Yinan and Dai, Andrew M. and Chen, Zhifeng and Sohn, Timothy and Wu, Yonghui (2019)). Our system also has a completion feature, which is spe- cialized in academic writing (e.g., completing a text based on a section name).

  3. The Langsmith editor

    1. Overview

      This section presents Langsmith, a web-based text editor for academic writing assistance (Figure 2). The system has the following three features: (i) text revision, (ii) text completion, and (iii) a gram-


      5https://www.grammarly.com 6https://languagetool.org

      7https://www.gingersoftware.com 8writeahead.nlpweb.org

      9https://transformer.huggingface.co

      matical/spelling error checker. These features are activated when users select a text span, type a word, or push a special key.

      As a case study, this work focuses on paper writ- ing in the NLP domain. Thus, each assistance feature is specialized in the NLP domain. The fol- lowing sections explain the details of each feature.

    2. Revision feature

      The revision feature, the main feature of Langsmith, suggests better sentences in terms of fluency and style for a given draft sentence (Figure 2). This feature is activated when the user selects a sentence or smaller unit.

      Writers sometimes struggle to put their ideas into words. Thus, the input draft for the revision systems can be incomplete, or less informative. Based on such a challenging situation, we examine the REQUEST and SELECT framework to help users discover sentences that better match what the user wanted to write.

      REQUEST stage. Langsmith provides two ways for users to request a specific revision, which can prevent unnecessary revisions being provided to the user.

      First, users can specify where the system should intensively revise a text.10 That is, when a part of a sentence is selected, the system intensively rephrases the words around the selected part.11 Fig- ure 3 demonstrates the change of the revision fo- cus, depending on the selected text span. Note that controlling the revision focus was not explored in the original sentence-level revision task (Ito et al., 2019). This feature is also inspired by Grangier and Auli (2018).

      Second, users can insert placeholder symbols, “()”, at specific points in a sentence. The sys- tem revises the sentence by replacing the symbol with an appropriate expression regarding its con- text. The input for the revision in Figure 2 also has the placeholder symbol. Here, for example, the symbol is replaced with “the task.” This fea- ture is inspired by Zhu et al. (2019); Donahue et al. (2020); Ito et al. (2019).

      SELECT stage. The system provides several re- visions (Figure 2). Note that there is typically more


      10The system performs sentence-level revisions. Hence the users are instructed to select the non-sentence-crossing area. 11We allow the system to correct the parts outside the se- lected span because sometimes the revision for a specific part

      requires another adjustment for the other parts.


      1. Revisions focusing on This formulation · · · and output.


      2. Revisions focusing on promote.

      3. Revisions focusing on human–computer interaction.


      Figure 3: The focus of the revision depends on the parts selected by users.


      than one plausible revision in terms of fluency and style, in contrast to correcting surface-level errors (Napoles et al., 2017).

      The diversity of the output revisions is encour- aged using diverse beam search (Vijayakumar et al., 2018). In addition, these revisions are ordered by a language model that is fine-tuned for NLP papers. That is, revisions with lower perplexity are listed in the upper part of the suggestion box. Further- more, the revisions are highlighted in colors, which makes it easier to distinguish the characteristics of each revision.

      Implementation. We trained a revision model using LightConv (Wu et al., 2019) implemented in Fairseq (Ott et al., 2019). The revision model gen- erates a sentence based on a given input sentence. The model was trained on a slightly modified ver- sion of the synthetic training data used in Ito et al.



      Figure 4: An example of the completion feature. These suggestions are conditioned by the left context, section name (Related work) and the paper title (Better Models for Grammatical Error Correction.)




      Figure 5: The interface of the error correction feature. Errors are automatically highlighted with a red line. The corrections are suggested when the user hovers over the highlighted words.


      (2019). As an example of these modifications, syn- thetic edit marks were added for a subset of the training data. These marks were attached to a part of the input sentence that has many edits compared to its reference.12 Thus, the marks can provide a hint for the system to determine where to edit. When using Langsmith, the marks are attached to the span selected by the users. The system is expected to intensively revise the wording in the specified span. Details are in Appendix A.

    3. Other features

      Completion feature. When the user presses the Tab key, the completion feature generates plausi- ble preceding phrases from the cursor point (Fig- ure 4). This feature can consider the paper title and section name as well as the text to the left of the cursor.

      We used GPT-2 small (117M) (Radford et al., 2019) fine-tuned on the papers collected from the ACL Anthology13. Paper titles and section names were concatenated at the beginning of the corre- sponding paragraphs in the fine-tuning data. De-


      12Special symbols are attached at the beginning and the end of the specific subsequence.

      13https://www.aclweb.org/anthology

      tails are in Appendix B.

      Error correction feature. We used Language- Tool,14 an open-source grammatical/spelling error correction tool. Each time the text changes, this feature is called upon. The detected errors are then automatically highlighted with red lines (Fig- ure 5).The corrections are listed when the user hov- ers over the highlighted words.

    4. Protocol

      Langsmith was developed based on the TEA- SPN Software Development Kit (Hagiwara et al., 2019).15 TEASPN defines a set of APIs for writing software (e.g., text editors) to communicate with servers that implement NLP technologies (e.g., re- vision model). We extended the protocol to con- vey title and section information in the completion feature. Since Langsmith is a browser-based tool and frequently communicates with a web server running models, we used WebSocket to achieve smooth communication.

  4. Experiments and results

    We demonstrate the effectiveness of human– machine interactions in revising drafts imple- mented in our system. We also check whether the REQUEST stage in the revision feature works adequately.

    1. On the revised draft quality

      Settings. We suppose a situation where a per- son writes a draft in their native language (non- English language), translates it to English, and then revises it further to create an English-language draft. In order to simulate this situation, we first collected Japanese-language version of the abstract sections from eight Japanese peer-reviewed jour- nals.16 Then, the abstracts were translated into English with an off-the-shelf translation system17. We considered the translated abstracts as first drafts. The task is to revise the first drafts. Expert transla- tors created reference final drafts from the Japanese versions of the drafts.18 We evaluated the quality of the revised versions by comparing them with the corresponding final drafts.


      14https://github.com/languagetool-org/

      languagetool/releases/tag/v3.2

      15https://github.com/teaspn/teaspn-sdk

      16We used the journals accepted at https://www.anlp. jp/en/index.html.

      17https://translate.google.co.jp 18We used https://www.ulatus.com/.


      Condition BLEURT

      Q. Strongly Slightly

      Slightly

      Strongly


      Human&Machine


      -0.45

      Human-only


      -0.51

      Machine-only


      -0.51

      First drafts


      -0.70


      Table 1: Comparison of the revision quality. The scores are averaged over the corresponding revisions. Higher scores indicate that the drafts are closer to the final drafts.


      We compared three versions of revised drafts to evaluate the effectiveness of Langsmith:

      • one fully and automatically revised by Lang- smith (MACHINE-ONLY revision)

      • one revised by a human writer without Lang- smith (HUMAN-ONLY revision), and

      • one revised by a human writer us- ing assistance features in Langsmith (HUMAN&MACHINE revision).

        The following paragraphs explain how we obtained the above three versions of the revisions. Ap- pendix C shows the statistics of the drafts.

        MACHINE-ONLY revision. We automatically applied the revision feature to the drafts (each sen- tence) without the REQUEST and Select stages. For each sentence, the revision with the highest gener- ation probability was selected.19 We created one MACHINE-ONLY revision for each first draft.

        HUMAN-ONLY revision. Human writers revise a given first draft. The writers can only access to the error correction feature. This setting simulates the situations that writers typically face.

        HUMAN&MACHINE revision. Human writers revise a given first draft with full access to the Langsmith features.

        Human writers. We asked 16 undergraduate and master’s students at an NLP laboratory to revise the first drafts in terms of fluency and style. The students were Japanese natives, representatives of the inexperienced researchers in a country where the spoken language is considerably different from English. Each participant revised two different first drafts, one with the HUMAN-ONLY setting and the other one with the HUMAN&MACHINE setting.


        19The hyperparameters for decoding revisions were the same as the revision feature in Langsmith. Re-ranking with the language model was also employed.


                 

        agree


        agree


        disagree


        disagree

        (I)

        87.5


        12.5


        0


        0

        (II)

        50.0


        50.0


        0


        0

        (III)

        62.5


        31.3


        6.3


        0

        (IV)

        12.5


        50.0


        31.3


        6.3

        (V)

        75.0


        12.5


        6.3


        6.3

        (VI)

        43.8


        43.8


        12.5


        0

        Table 2: Results of the user study about (I)-(VI). The scores denote the percentage of the participants who chose the option.


        Half of the participants first revised a draft with the HUMAN-ONLY setting, and then revised another draft with the HUMAN&MACHINE set- ting; the other half performed the same task in the opposite order. Ultimately, we collected two HUMAN&MACHINE revisions and two HUMAN- ONLY revisions for each first draft.

        Comparison and results. We compared the quality of the three versions of the revised drafts: MACHINE-ONLY revision, HUMAN-ONLY revi- sion, and HUMAN&MACHINE revision. We com- pared the revised drafts with their corresponding final draft using BLEURT (Sellam et al., 2020), the state-of-the-art automatic evaluation metric for natural language generation tasks. Details of the evaluation procedure is shown in Appendix D. Note that the score is not in the range [0, 1], and a higher score means that the revision is closer to the final draft. Table 1 shows that HUMAN&MACHINE re- visions were significantly better20 than MACHINE- ONLY and HUMAN-ONLY revisions. The results suggest the effectiveness of human–machine inter- action achieved in Langsmith. Since this experi- ment was relatively small in scale and only used an automatic evaluation metric, we will conduct a larger-scale experiment with human evaluations in the future.

    2. User study

      After the experiments outlined in Section 4.1, we asked the participants about the usability of Lang- smith. The 16 participants were instructed to eval- uate the following statements:

      1. Langsmith was more helpful than the Baseline environment for the revision task.

        20We applied a bootstrap hypothesis test (Koehn, 2004), and the score of HUMAN&MACHINE was significantly higher than the HUMAN-ONLY and MACHINE-ONLY scores (p < 0.05).

        Feature percentage revision 100

        completion 31.3

        correction 62.5

        kens were inserted before wi and after wj in x. We denote the input sentence with these edit marks as xedit. We then obtained 10-best outputs of the revi- sion system (yedit, · · · , yedit) for each xedit. Here,

        1 10

        Table 3: Results of the user study about helpful fea- tures. The scores denote the percentage of the partici- pants who chose the feature (multiple choice question).

      2. Comparing the text written by the two envi-

        these output sentences were generated through the

        diverse beam search with the same settings as the revision feature in Langsmith. We calculated the following score for each input sentence and its re- visions:

        k

        k

        r = |{yedit | xi:j ngram(yedit), 1 k 10}|

        ronments, the text written with Langsmith was better.

      3. The feature of specifying where to intensively revise was helpful.

      4. The placeholder feature in the revision feature was helpful.

      5. Providing more than one output from the revi- sion feature was helpful.

      6. Providing more than one output from the com- pletion feature was helpful.

      The participants evaluated the statements (I)-(VI) on a four-point scale: (a) strongly agree, (b) slightly agree, (c) slightly disagree, and (d) strongly dis- agree. In addition, the participants answered whether each feature was helpful in writing.

      Results. Tables 2 and 3 show the results of our user study. From the responses to (I) and (II), we observed that the users were satisfied with the writ- ing experience with Langsmith. The responses to (III), (IV), and (V) support the idea that our RE- QUEST and SELECT stages are helpful. Here, using the place holders was relatively not helpful. The responses to (VI) also suggest that showing several candidates does not bother the users. Table 3 dis- plays the result of whether each feature was helpful in writing. The result indicates that the revision fea- ture was the most useful for creating drafts using the implemented features.

    3. Sanity check of the REQUEST stage

      Finally, we checked the validity of our method to control the revision based on the selected part of the sentence (Figure 3).

      Settings. We randomly collected 1,000 sentences from the first drafts created with the translation system. In each sentence with T tokens x = (w1, · · · , wT ), we randomly inserted edit marks to specify a certain span s = (i, j) in x (1 i < j T, 1 j i 5). Specifically, special to-

      where xi:j denotes the subsequence (wi, · · · , wj) in x. The function ngram(·) returns a set of all the n-grams of a given sequence. A lower r indicates that the subsequence specified with the edit marks are more frequently rephrased.

      We also obtained a score rl for each x. rl was calculated using the input without the edit marks x and its 10-best outputs yk. We compared r and rl for each x.

      Results. We observed that r frequently21 had lower values than rl. That is, a certain subsequence was more rephrased by the revision system when it had the edit marks than when it did not. These results validate our approach of controlling the revi- sion focus, which is implemented in the REQUEST stage of the revision feature.

  5. Conclusions

We have presented Langsmith, an academic writing assistance system. Langsmith provides a writing environment, in which human writers use several assistance features to improve the quality of texts. Our experiments suggest that our system is useful for inexperienced, non-native writers in revising English-language papers. We are aware that our experimental settings were not fully well-designed (e.g., we had only Japanese participants, and no human evaluation). We will evaluate Langsmith in more sophisticated settings. We hope that our system contributes to breaking language barriers in the academic community.

Acknowledgement

We are grateful to Ana Brassard for her feedback on English. We also appreciate the participants of our user studies. This work was supported by Grant-in- Aid for JSPS Fellows Grant Number JP20J22697.


21We conducted the one-side sign test. The difference is significant with p 0.05.

References

Chen, Mia Xu and Lee, Benjamin N. and Bansal, Gagan and Cao, Yuan and Zhang, Shuyuan and Lu, Justin and Tsay, Jackie and Wang, Yinan and Dai, Andrew M. and Chen, Zhifeng and Sohn, Timothy and Wu, Yonghui. 2019. Gmail smart compose: Real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Miningg (KDD’19), page 2287–2295.

Robert Dale and Adam Kilgarriff. 2011. Helping our own: The HOO 2011 pilot shared task. In Proceed- ings of the 13th European Workshop on Natural Lan- guage Generation (ENLG 2011), pages 242–249.

Vidas Daudaravicˇius. 2015. Automated evaluation of scientific writing: AESW shared task proposal. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2015), pages 56–63.

Chris Donahue, Mina Lee, and Percy Liang. 2020. En- abling language models to fill in the blanks. In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL 2020), pages 2492–2501. Association for Computational Linguistics.

David Grangier and Michael Auli. 2018. QuickEdit: Editing text & translations by crossing words out. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics (NAACL 2018), pages 272–282.

Masato Hagiwara, Takumi Ito, Tatsuki Kuribayashi, Jun Suzuki, and Kentaro Inui. 2019. TEASPN: Framework and protocol for integrated writing as- sistance environments. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing: Sys- tem Demonstrations (EMNLP-IJCNLP 2019), pages 229–234.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. In Proceedings of the 8th Inter- national Conference on Learning Representations (ICLR 2020).

Takumi Ito, Tatsuki Kuribayashi, Hayato Kobayashi, Ana Brassard, Masato Hagiwara, Jun Suzuki, and Kentaro Inui. 2019. Diamonds in the rough: Gen- erating fluent sentences from early-stage drafts for academic writing assistance. In Proceedings of the 12th International Conference on Natural Language Generation (INLG 2019), pages 40–53.

Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP 2004), pages 388–395.

John Lee and Jonathan Webster. 2012. A corpus of tex- tual revisions in second language writing. In Pro- ceedings of the 50th Annual Meeting of the Asso- ciation for Computational Linguistics (ACL 2012), pages 248–252.

Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Pro- ceedings of the 15th Conference of the European Chapter of the Association for Computational Lin- guistics (EACL 2017), pages 229–234.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Sys- tem Demonstrations (NAACL 2019), pages 48–53.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.

Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics (ACL 2020), pages 7881–7892.

Chung-Ting Tsai, Jhih-Jie Chen, Ching-Yu Yang, and Jason S. Chang. 2020. LinggleWrite: a coaching system for essay writing. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics: System Demonstrations (ACL 2020), pages 127–133.

Ashwin K. Vijayakumar, Michael Cogswell, Ram- prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2018. Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. In Proceedings of the Thirty- Second AAAI Conference on Artificial Intelligence (AAAI 2018), pages 7371–7379.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. HuggingFace’s Trans- formers: State-of-the-art Natural Language Process- ing. arXiv preprint arXiv:1910.03771.

Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2019. Pay Less Attention with Lightweight and Dynamic Convolutions. In Pro- ceedings of the 7th International Conference on Learning Representations (ICLR 2019).

Jian-Cheng Wu, Yu-Chia Chang, Teruko Mitamura, and Jason S. Chang. 2010. Automatic Collocation Suggestion in Academic Writing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics Conference Short Papers (ACL 2010), pages 115–119.

Seid Muhie Yimam, Gopalakrishnan Venkatesh, John Lee, and Chris Biemann. 2020. Automatic compila- tion of resources for academic writing and evaluat- ing with informal word identification and paraphras- ing system. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), pages 5896–5904.

Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Text Infilling. arXiv preprint arXiv:1901.00158.

  1. Details on revision model

    Data. We trained the revision model using the slightly modified version of the synthetic training data introduced in Ito et al. (2019). They created several types of synthetic training data with several noising methods; (i) heuristic noising method, (i) grammatical error generation, (iii) style removal, and (iv) entailed sentence generation. We used the data created by the heuristic noising method, style removal, and the entailed sentence generation for training the revision model. Note that we did not use the data generated by the grammatical error generation because grammatical error correction feature was implemented separately from the revi- sion feature in Langsmith.

    We attached the edit marks to the subpart of the training data generated by the style removal method. Let x1:N = (x1, x2, · · · , xN ) and y1:T =

    (y1, y1, · · · , yM ) be an input sentence with N to-

    kens and its revision with M tokens, respectively.

    Here x was the synthetic draft sentence generated by the style removal method from y. The training dataset consists of the pairs of (x, y).

    For each (x, y), we first determined if each word in x was rewritten compared to y. We assumed that a token xi x was rewritten if a token with the

    same lemma as xi was not in {yj|max(0, i 3)

    j min(M, i + 3)}. Here we obtained a sequence

    c ∈ {0, 1}N , where each element ci corresponds to whether the token xi was rewritten or not. If xi

    ci

    was written in y, ci is 1; otherwise ci is 0. Then, we defined a score r(c) for each (x, y) as follows:

    N

    r(c) = i=1

    |c|

    where | · | returns the length of the vector. If r(c) >

    0.4, we did not attach the edit marks.

    When r(c) 0.4, we obtained a span s = (a, b)

    for x and c as follows:

    When the users select a subsequence of a sen- tence in Langsmith, the edit marks are attached to the input sentence. For example, if the user selects a span “promote” in the sentence “This formulation of the input and output promotes human-computer interaction.”, the input to the revision feature is formatted as follows: This formulation of the input and output <? promotes

    ?> human-computer interaction.


    Model. Table 4 shows the hyperparameters of the revision model. In the decoding phase, we used the diverse beam search (Vijayakumar et al., 2018). Beam size is set to 15. The diverse beam group and the diverse beam strength are 15 and 1.0, respectively.

    Specifically, we first obtained top-15 hypothe- ses, and then these hypotheses were re-ranked by the language model. Here, the language model considers 20 tokens in the left context and 20 to- kens in the right context beyond the sentence. We excluded the hypotheses with a perplexity greater than 1.3 times the perplexity of the input. We fi- nally showed the top-8 revisions re-ranked to the users. The language model used for re-ranking is the same as the model used for the completion feature (Appendix B).


  2. Details on completion model

    Data. We collected 234,830 PDFs of the papers published in ACL Anthology by 2019. We used GROBID22 for extracting the text information from the PDF files. The training data is formatted as shown in Table 5. The title name is omitted with 20% probability. The order of the sections in the same paper was shuffled.

    Model. We used a pre-trained GPT-2 small (117M). Table 6 shows the hyperparameters for

    b a1 N +1

    argmax

    (a,b)∈S


    i

    where cl

    L cl L cl L cl

    i i i

    i=a i=0 i=b+1

    10 (ci = 1)

    = 0 (i = 0, N + 1)

    1 (otherwise)

    fine-tuning the pre-trained GPT-2. We used an im- plementation in Transformers (Wolf et al., 2019). We used nucleus sampling (Holtzman et al., 2020) with p = 0.97 to generate the texts.


  3. Statistics of the drafts

    S = {(a, b) | a, b 1, · · · , N, a b}

    Based on the obtained s = (a, b), we inserted <?

    before the token xa, and ?> after the token xb. We

    Table 7 shows the statistics of the drafts collected in Section 4. The column “word type” shows the number of types of the tokens used in the drafts.

    included the data with special symbols added by   

    such a procedure in the training data.

    22https://github.com/kermitt2/grobid

    Fairseq model architecture lightconv iwslt de en

    algorithm Adam

    learning rate 5e-4

    Optimizer



    Learning rate scheduler

    adam epsilon 1e-08

    adam betas (0.9, 0.98)

    weight decay 0.0001

    clip norm 0.0

    type inverse sqrt

    warmup updates 4000

    warmup init lrarning rate 1e-7

    min learning rate 1e-9


    Training batch size 24,000 tokens

    updates 1,050,530 steps

    Table 4: Hyperparameters of the revision feature.


    @ Title @



    @ Title (of another paper) @

    · · ·

    Table 5: The format of the training data for the comple- tion model.


  4. Details on the evaluation in Section 4.1

We used BLEURT-Base with 128 max tokens.23 BLEURT is designed to evaluate the similarity of a given sentence pair. Thus, we first split each draft into sentences, and each sentence in first drafts is aligned with the most similar sentence in the corresponding final draft. Sentence splitting and sentence alignment is achieved by spaCy.24 Note that the references has been created so that the sen- tence separation does not change from the original first draft. Finally, we calculate each sentence pair with BLEURT, and averaged the results.


23https://storage.googleapis.com/

bleurt-oss/bleurt-base-128.zip

24Sentence similarity is computed using cosine similarity of average word vectors. We used spaCy’s en core web lg model.



Model architecture gpt2

algorithm Adam

learning rate 5e-5

Optimizer



Learning rate scheduler

adam epsilon 1e-8

adam betas (0.9, 0.999)

weight decay 0.0

clip norm 1.0

type linear

warmup updates 0

max learning rate 5e-5

total epochs (just used for scheduling) 100


Training batch size 262,144 tokens

updates 138,300 steps

Table 6: Hyperparameters for fine-tuning LMs.


drafts


length


word types

Final drafts (reference)


199 ± 52


108 ± 17

Human&Machine


192 ± 40


101 ± 17

Human-only


192 ± 43


100 ± 16

Machine-only


199 ± 58


105 ± 22

First drafts


202 ± 56


104 ± 22


Table 7: Statistics of the drafts. The scores are averaged over the drafts. The values following “±” denote the standard deviation of the scores.